-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Centralizing the emoji keyword generation logic #379
Centralizing the emoji keyword generation logic #379
Conversation
Thank you for the pull request!The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :) If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you! Maintainer checklist |
CC @andrewtavis @KesharwaniArpita @catreedle @DeleMike @VNW22. Please do review and contribute. Thank you |
""" | ||
Generate emoji keywords for a specified language, with optional support for grouped languages (e.g., Hindustani = Hindi + Urdu). | ||
|
||
Parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a start here, @Ekikereabasi-Nk, could I ask you to change the function docstrings to match those in the rest of the project :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright @andrewtavis will do so. Thank you.
- sub_languages (list): A list of specific sub-languages for grouped languages (e.g., ["Hindi", "Urdu"]). If not provided, all sub-languages in the group will be processed. | ||
""" | ||
|
||
# Define grouped languages and their sub-languages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And own line comments should have a period at the end of them. Inline comments don't need capitalization of the first word or punctuation, but those that are their own line are sentences.
|
||
def generate_emoji_keyword( | ||
language, | ||
emojis_per_keyword, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest setting emojis_per_keyword=3 as a default value. This way, we can reduce the need to specify the parameter each time. I think for most cases 3 would be a suitable default. Of course, this would still allow customization when necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed :)
# Set up the argument parser | ||
parser = argparse.ArgumentParser( | ||
description="Generate emoji keywords for a specific language." | ||
) | ||
parser.add_argument( | ||
"--file-path", required=True, help="Path to save the generated emoji keywords." | ||
) | ||
parser.add_argument( | ||
"--sub-languages", | ||
nargs="*", | ||
help="List of specific sub-languages to process (e.g., Hindi Urdu). If omitted, all sub-languages will be processed.", | ||
) | ||
parser.add_argument( | ||
"--gender", | ||
choices=["male", "female", "neutral"], | ||
help="Specify the gender for emoji customization.", | ||
) | ||
parser.add_argument( | ||
"--region", help="Specify the region for emoji customization (e.g., US, IN)." | ||
) | ||
|
||
# Parse the command-line arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we set up a common argument parsing function for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @catreedle I will take note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
welcome! I'm not really sure of the differences between individual languages to decide if it is plausible, but it is something to consider. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@catreedle this will mean to create a modular function for the common parser file, most likely in a separate file that can be imported to each generate_emoji_keyword
file for all language. Also @SethiShreya mention the is a parser function somewhere within the codebase, I have not seen it. Please you can help point it out for me to study. So I will create another file in the Unicode folder for this parser function related to generate_emoji_keyword
for each language. Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find it either...
maybe something like this could work?
def setup_argument_parser():
parser = argparse.ArgumentParser(description="Generate emoji keywords for a specific language.")
parser.add_argument(
"--file-path", required=True, help="Path to save the generated emoji keywords."
)
...
return parser
254e9dc
to
3c41c6d
Compare
Quick note being sent to all the testing PRs, if updates are needed now that #402 has been merged, then it'd be great to get those updates to the branch :) If no updates are needed, then let me know 😊 |
hey @Ekikereabasi-Nk @andrewtavis @catreedle Arabic Please feel free to recheck my findings or let me know if I missed anything. |
Amazing, @VNW22! What's the code for this like? Would we be able to make a workflow to make sure that there's no directory for emoji keywords directory for those language that don't have support? Might be something to add to the project structure check :) Are there also languages that do have support that we currently don't have in the project? |
This is the python script i used; def find_languages_with_emoji_support(root_dir):
root_directory = "./cldr-json/cldr-localenames-full/" languages = find_languages_with_emoji_support(root_directory) The code for finding languages without was not working, so I opted for this. |
Nice 😊 Let's definitely add this to the project structure workflow check once this is done. We also have functionality to update the CLDR data, so maybe new languages will be added and then this workflow will tell us to add emoji support 😇 |
520d2b1
to
281b735
Compare
|
def setup_arg_parser(): | ||
parser = argparse.ArgumentParser( | ||
description="Generate emoji keywords for a specific language." | ||
) | ||
parser.add_argument( | ||
"--file-path", required=True, help="Path to save the generated emoji keywords." | ||
) | ||
parser.add_argument( | ||
"--sub-languages", | ||
nargs="*", | ||
help="List of specific sub-languages to process (e.g., Hindi Urdu). If omitted, all sub-languages will be processed.", | ||
) | ||
parser.add_argument( | ||
"--gender", | ||
choices=["male", "female", "neutral"], | ||
help="Specify the gender for emoji customization.", | ||
) | ||
parser.add_argument( | ||
"--region", help="Specify the region for emoji customization (e.g., US, IN)." | ||
) | ||
parser.add_argument( | ||
"--emojis-per-keyword", | ||
type=int, | ||
help="Number of emojis to generate per keyword.", | ||
) | ||
return parser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks great! thank you for taking my suggestion into account. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @Ekikereabasi-Nk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work!!!
85767b1
to
f602dc3
Compare
Hi @andrewtavis @SethiShreya @KesharwaniArpita please review. Also check #440 if they are compatible. Thank you |
This function serves as a foundational component for generating | ||
emojis tailored to specific user demographics, and implementing it in a | ||
modular fashion will support future enhancements and maintenance. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes Sense!! Does that mean there's a lot more scope for emoji generation by expanding the demographics? Just confirming. Actually sounds interesting. Can we do anything in reference to this? Maybe expand the emoji generation? What do you say @Ekikereabasi-Nk and @andrewtavis .
Thank you @Ekikereabasi-Nk for the work.
@Ekikereabasi-Nk I was reading the generate_emoji_functionality file it still have args, and one more thing I saw is you are generating emoji_keywords for all the languages right? Why? Wasnt we suppose to get the parameters from main file and then generate for the language provided, @andrewtavis did I understand it right? |
Closing this in favor of #440, @Ekikereabasi-Nk :) Really appreciate the work here! Was just a bit easier to integrate the other PR as it directly addressed the current functionality without expanding into new features. |
Contributor checklist
Description
This PR addresses issue #359 by centralizing the emoji keyword generation logic and simplifying the language-specific scripts to enhance maintainability, flexibility, and modularity across the project. The changes are structured into parts, focusing on minimizing code duplication and improving support for language-specific customizations like gender, region and grouped Languages (Hindustani (which contains both Urdu and Hindi) or Norwegian (which includes both Bokmål and Nynorsk) suggested by @KesharwaniArpita.
Centralize the Core Logic
The core functionality for generating emoji keywords, which is duplicated across most language folders, has now been moved to a single centralized file:
src/scribe_data/unicode/generate_emoji_keyword.py
This file now handles the core logic, including the generation of emoji keywords and the export of formatted data. The centralized approach ensures that all shared functionality resides in one place, reducing the risk of inconsistencies.
Centralized Function: The
generate_emoji_keyword
function now resides in this centralized location. This functiontakes the
language
,emojis_per_keyword
, andfile_path
as core parameters and performs the keyword generation bycalling
gen_emoji_lexicon
and subsequently exporting the data usingexport_formatted_data.
Shared Logic: The logic for generating the emoji lexicon and handling the export process has been consolidated into
this centralized file, eliminating duplication of this logic across the language-specific modules.
Handle Language-Specific Customizations (Including Grouped Languages)
The
generate_emoji_keyword
function has been enhanced to accommodate language-specific customizations, including grouped languages.Grouped Language Support: Languages like Hindustani (which contains both Urdu and Hindi) and Norwegian (which includes both Bokmål and Nynorsk) are now available.
Customizations for Gender and Region:
gender
andregion
are now available.regional differences like US or JP variants).
Modify Language-Specific Files
Each language emoji_keyword folder with a generate_emoji_keyword.py file under language_data_extraction can be updated by using:
generate_emoji_keyword.py
from thesrc/scribe_data/unicode/generate_emoji_keyword.py
fromfrom scribe_data.unicode.generate_emoji_keyword import generate_emoji_keyword
.Refactor process_unicode.py
The process_unicode.py file has been refactored to integrate the centralized generate_emoji_keyword logic:
Centralized Emoji Lexicon Generation: Shared logic for emoji generation now resides in the
gen_emoji_lexicon
function, making it more consistent across language modules.Support for Grouped Languages: The function handles grouped languages like Hindustani (Hindi and Urdu) and Norwegian (Bokmål and Nynorsk), processing sub-languages individually.
Gender and Region Customizations: Parameters for gender and region allow emoji customization based on cultural and demographic factors.
#ISSUE_NUMBER
-Simplify project's emoji keyword functionality #359